Introduction

This tutorial has two major aims: The first one is to show the general workflow of how land cover classifications (or similar tasks) based on satellite data can be performed in R using machine learning algorithms. The second important aim is to show how to assess the area to which a spatial prediction model can be applied (“Area of applicability”, AOA). This is relevant because in spatial predictive mapping, models are often applied to make predictions far beyond sampling locations (i.e. field observarions used to map a variable even on a global scale), where new locations might considerably differ in their environmental properties. However, areas in the predictor space without support of training data are problematic. The model has no knowledge about these environments and predictions for such areas have to be considered highly uncertain.

Prediction task

The example prediction task is to perfom a supervised land cover classification for the Münster in Germany. The dataset to do this includes selected spectral channels of a Sentinel-2 scene. As resposne (reference/ground truth) we use digitized polygons that were created on the basis of expert knowledge.

How to start

For this tutorial we need the terra package for processing of the satellite data as well as the caret package as a wrapper for machine learning (here: randomForest) algorithms. Sf is used for handling of the training data available as vector data (polygons). CAST will be used to account for spatial dependencies during model validation as well as for the estimation of the AOA.

rm(list=ls())
#major required packages:
library(terra)
library(sf)
library(caret)
library(mapview)
library(CAST)
library(tmap)

Data preparation

Load and explore the data

To start with, let’s load and explore the remote sensing raster data as well as the vector data that include the training sites.

Raster data (predictor variables)

sen_ms <- rast("data/sen_muenster.tif")
print(sen_ms)
## class       : SpatRaster 
## dimensions  : 664, 798, 7  (nrow, ncol, nlyr)
## resolution  : 10, 10  (x, y)
## extent      : 400620, 408600, 5753560, 5760200  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=32 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## source      : sen_muenster.tif 
## names       :     B02,     B03,     B04,   B08,        B06,        B07, ... 
## min values  :    37.5,   211.5,     1.0,     1,   313.1562,   279.8438, ... 
## max values  : 14582.5, 15787.5, 18767.5, 19929, 10972.9688, 11645.3438, ...

The multi-band raster contains a subset of the optical data from Sentinel-2 (see band information here: https://en.wikipedia.org/wiki/Sentinel-2) given in scaled reflectances (B02-B11). Let’s plot the multi-band raster to get an idea how the variables look like.

plot(sen_ms)

### true color and false color composites
plotRGB(sen_ms,r=3,g=2,b=1,stretch="lin")

plotRGB(sen_ms,r=4,g=3,b=2,stretch="lin")

Calculate the NDVI

As an additional predictor variable, we will calculate the NDVI and directly add it as a new raster band

sen_ms$NDVI <- (sen_ms$B08-sen_ms$B04)/(sen_ms$B08+sen_ms$B04)
plot(sen_ms$NDVI)

Vector data (Response variable)

The vector file is read as sf object. It contains the training sites of 7 Land cover classes. These are polygons (33 in total) that were digitized in QGIS on the basis of the Sentinel data and with support of an aerial image and using expert knowledge. They can be regarded here as a ground truth for the land cover classification.

trainSites <- read_sf("data/trainingsites_muenster.gpkg")
print(trainSites)
## Simple feature collection with 33 features and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 400682 ymin: 5753925 xmax: 408327.9 ymax: 5760113
## Projected CRS: WGS 84 / UTM zone 32N
## # A tibble: 33 × 4
##    ClassID Label         PolygonID                                          geom
##      <dbl> <chr>             <int>                                 <POLYGON [m]>
##  1      11 Planted field         1 ((402789.2 5759326, 402850.9 5759266, 402901…
##  2      11 Planted field         2 ((403154.2 5759459, 403357.6 5759470, 403373…
##  3      11 Planted field         3 ((403124.5 5759788, 403314.6 5759806, 403397…
##  4      14 Inland water          4 ((404809.3 5757129, 404865.6 5757163, 404865…
##  5      14 Inland water          5 ((403729 5756102, 403797.5 5756077, 403770.6…
##  6      14 Inland water          6 ((408139 5759018, 408201.5 5759011, 408189.7…
##  7      14 Inland water          7 ((406992.5 5756470, 407110.9 5756441, 407184…
##  8      14 Inland water          8 ((406334.9 5755042, 406351.1 5755068, 406382…
##  9       4 Grassland             9 ((403302.3 5755592, 403332.2 5755597, 403337…
## 10       4 Grassland            10 ((405602.6 5760078, 405610.9 5760104, 405628…
## # ℹ 23 more rows

Using mapview’s viewRGB function we can visualize the aerial image channels as true color composite in the geographical context and overlay it with the polygons. Click on the polygons to see which land cover class is assigned to a respective polygon.

viewRGB(as(sen_ms,"Raster"), r = 3, g = 2, b = 1, map.types = "Esri.WorldImagery")+
  mapview(trainSites)

Extract raster information

In order to train a machine learning model between the spectral properties and the land cover class, we first need to create a data frame that contains the predictor variables at the location of the training sites as well as the corresponding class information. This data frame can be produced with the extract function. However, first we randomly select pixels within the polygons, otherwise the training data set would be too large.

trainSites$PolygonID <- 1:nrow(trainSites) # lets keep the polygon ID
samplepoints <- st_sample(trainSites,800)
samplepoints <- st_intersection(trainSites,samplepoints)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

Now we can extract and combine with the attributes from the reference sites.

trainDat <- extract(sen_ms, samplepoints, na.rm=FALSE)
trainDat <- cbind(trainDat, samplepoints)
head(trainDat)
##   ID  B02 B03  B04    B08      B06      B07      B11        NDVI ClassID
## 1  1  894 817  492 2046.0 1724.156 1974.125 1514.062  0.61229314       3
## 2  2  846 689  413 3336.0 2289.062 3289.156 1127.781  0.77967458      11
## 3  3  896 740  541 1649.0 1616.562 1822.156 1486.219  0.50593607       3
## 4  4  845 656  395 3299.0 2292.375 3303.469 1113.344  0.78613969      11
## 5  5  882 645  403  303.5  390.375  386.875  164.500 -0.14083510      14
## 6  6 1053 864 1058 1237.0 1135.000 1228.156 1787.375  0.07799564       9
##           Label PolygonID                     geom
## 1  Mixed forest        22 POINT (402811.4 5755982)
## 2 Planted field        25 POINT (401867.7 5755617)
## 3  Mixed forest        22 POINT (402913.6 5755719)
## 4 Planted field        25 POINT (401917.1 5755573)
## 5  Inland water         4 POINT (404746.4 5757103)
## 6         Urban        16 POINT (405773.6 5757690)

Explore the data

boxplot(NDVI~Label,data=trainDat)

featurePlot(x = trainDat[, c("B02","B04","B08","NDVI","B11")], 
            y = factor(trainDat$Label), 
            plot = "pairs",
            ## Add a key at the top
            auto.key = list(columns = 3))

Basic model training

Predictors and response

For model training we need to define the predictor and response variables. As predictors we can use basically all information from the satellite image as we might assume they could all be meaningful for the differentiation between the land cover classes. As response variable we use the “Label” column of the data frame.

predictors <- c("B02","B03","B04","B08","B06","B07","B11","NDVI")
response <- "Label"

Model training

We then train a Random Forest model to lean how the classes can be distinguished based on the predictors (note: other algorithms would work as well. See https://topepo.github.io/caret/available-models.html for a list of algorithms available in caret). Caret’s train function is doing this job.

# train the model
set.seed(100)
model <- train(trainDat[,predictors],
               trainDat[,response],
               method="rf",
               importance=TRUE)
print(model)
## Random Forest 
## 
## 800 samples
##   8 predictor
##   7 classes: 'Fallow field', 'Grassland', 'Industrial', 'Inland water', 'Mixed forest', 'Planted field', 'Urban' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 800, 800, 800, 800, 800, 800, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9693550  0.9621422
##   5     0.9673158  0.9596287
##   8     0.9659845  0.9579642
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(varImp(model))

Model prediction

To perform the classification we can use the trained model and apply it to each pixel of the raster stack using the predict function. Then we can then create a map with meaningful colors of the predicted land cover using the tmap package.

prediction <- predict(sen_ms,model)
cols <- c("sandybrown", "green", "darkred", "blue", "forestgreen", "lightgreen", "red")

tm_shape(prediction) +
  tm_raster(palette = cols,title = "LUC")+
  tm_scale_bar(bg.color="white",bg.alpha=0.75)+
  tm_layout(legend.bg.color = "white",
            legend.bg.alpha = 0.75)

Tuning, validation, variable selection

The code shown above was a basic example of how to train a model and do a classification. But there is more to it. We have to deal with spatial autocorrelation of the data for model tuning and validation and possibly also during a varible selection.

Train Control

Before starting model training we can specify some control settings using trainControl. For hyperparameter tuning (mtry) as well as for error assessment we use a spatial 3-fold cross-validation. Therefore the training data are split into 3 folds but data from the same polygon are always grouped so that they never occur in both, training and testing. Also we make sure that each fold contains data from each land cover class. CAST’s CreateSpacetimeFolds is doing this job when we specify the polygon ID and the class label.

indices <- CreateSpacetimeFolds(trainDat,spacevar = "PolygonID",k=3,class="Label")
ctrl <- trainControl(method="cv", 
                     index = indices$index,
                     savePredictions = TRUE)

We can visualize the “representativness” of the CV folds with plot_geodist:

geodist <- plot_geodist(samplepoints,sen_ms,
              cvfolds = indices$indexOut,
             cvtrain = indices$index)

See how it compares to a random CV:

randomfolds <- createFolds(samplepoints$ClassID, k = 3, list = TRUE, returnTrain = FALSE)

geodist <- plot_geodist(samplepoints,sen_ms,
              cvfolds = randomfolds)

We can also explore how it would look like using NNDM

# Define the modeldomain (here the extent of the satellite image)
modeldomain <- st_as_sf(st_as_sfc(st_bbox(sen_ms)))
modeldomain <- st_transform(modeldomain,st_crs(samplepoints))

#Define folds:
nndmfolds <- knndm(samplepoints,modeldomain=modeldomain)

#plot
geodist <- plot_geodist(samplepoints,sen_ms,
                        cvfolds = nndmfolds$indx_test)

Spatial variable selection

Model training is performed using caret’s train function. However if we want to test, which predictor variables are relevant for making predictions on new spatial locations, we can use the forward feature selection (ffs) function from the CAST package, which is a wrapper around the train function. We specify “rf” as method, indicating that a Random Forest is applied. For model training we reduce the number of trees (ntree) to 75 to speed things up. Note that usually a larger number (>250) is appropriate. We use the Kappa index for validation.

model <- ffs(trainDat[,predictors],
               trainDat[,response],
               method="rf",
               metric="Kappa",
               trControl=ctrl,
               importance=TRUE,
               ntree=75)
plot_ffs(model)

plot_ffs(model,plotType = "selected")

print(model)
## Random Forest 
## 
## 800 samples
##   5 predictor
##   7 classes: 'Fallow field', 'Grassland', 'Industrial', 'Inland water', 'Mixed forest', 'Planted field', 'Urban' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 468, 582, 550 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9370669  0.9183222
##   3     0.9405925  0.9221624
##   5     0.9457047  0.9286068
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.

Prediction using the tuned model

prediction <- predict(sen_ms,model)
cols <- c("sandybrown", "green", "darkred", "blue", "forestgreen", "lightgreen", "red")

tm_shape(prediction) +
  tm_raster(palette = cols,title = "LUC")+
  tm_scale_bar(bg.color="white",bg.alpha=0.75)+
  tm_layout(legend.bg.color = "white",
            legend.bg.alpha = 0.75)

Model validation

When we print the model (see above) we get a summary of the prediction performance as the average Kappa and Accuracy of the three spatial folds. Looking at all cross-validated predictions together we can get the “global” model performance.

# get all cross-validated predictions:
cvPredictions <- model$pred[model$pred$mtry==model$bestTune$mtry,]
# calculate cross table:
table(cvPredictions$pred,cvPredictions$obs)
##                
##                 Fallow field Grassland Industrial Inland water Mixed forest
##   Fallow field           119         0          0            0            3
##   Grassland                0        31          0            0            0
##   Industrial               0         0         49            0            0
##   Inland water             0         0          0           77            0
##   Mixed forest             0         0          0            0          173
##   Planted field            0         3          0            0            8
##   Urban                    0         0          4            0            7
##                
##                 Planted field Urban
##   Fallow field              0     2
##   Grassland                 3     0
##   Industrial                0     2
##   Inland water              0     0
##   Mixed forest              3     9
##   Planted field            93     0
##   Urban                     0   214

Area of Applicability

We have seen that technically, the trained model can be applied to the entire area of interest (and beyond…as long as the sentinel predictors are available which they are, even globally). But we should assess if we SHOULD apply our model to the entire area. The model should only be applied to locations that feature predictor properties that are comparable to those of the training data. If dissimilarity to the training data is larger than the disimmilarity within the training data, the model should not be applied to this location.

The calculation of the AOA is quite time consuming. To make a bit faster we use a parallelization.

AOA <- aoa(sen_ms,model)
## No trainDI provided. Computing DI of training data...
## Computing DI of newdata...
## Computing AOA...
plot(AOA)

plot(AOA$DI)

plot(AOA$AOA)

The result of the aoa function has two layers: the dissimilarity index (DI) and the area of applicability (AOA). The DI can take values from 0 to Inf, where 0 means that a location has predictor properties that are identical to properties observed in the training data. With increasing values the dissimilarity increases. The AOA has only two values: 0 and 1. 0 means that a location is outside the area of applicability, 1 means that the model is inside the area of applicability. Find more information on how the AOA is derived in Meyer&Pebesma (2020).

The figure above shows the predictions ONLY for the AOA (Locations outside the AOA are shown in grey). We see that the model can be applied to most parts of Münster, however there are some locations (especially in the south-west) that are too different in their predictor properties so that we should exclude those predictions from our prediction result.

Summary

  • This tutorial has shown how to perform a remote sensing based land cover classification in R.
  • We identified the area of applicability (AOA) of the trained model to make sure that we don’t make predictions for locations that model has no knowledge about.
  • We transfered the model to a new area and concluded that a transfer is only possible when the model has knowledge about the new environment. Again, the AOA method was applied to identify the unknown locations.

Get further help

For further help on handling of raster and vector data in R see e.g. https://geocompr.github.io/. More information on machine learning: e.g. https://link.springer.com/book/10.1007/978-1-4614-6849-3. More information on the relevance of spatial validation strategies as well on the AOA can be found in recordings from OpenGeoHub (https://www.uni-muenster.de/RemoteSensing/lehre/summer_schools/) or in the Tutorials of the CAST package: https://hannameyer.github.io/CAST/